knitr::opts_chunk$set(echo = F, include = T,warning=F, message=F)
options(scientific=T, digits = 3) 
# options(scipen=9, digits = 3) 

Preprocess data

At first, we preprocessed the raw data in Python to obtain a nicer data frame (as in the raw data, some columns are written in JSON format).

Import data

Remove NAs and dups

Process columns

Create columns

Company

Graph of numnber of movies by company

Genres

##               Var1 Freq
## 1                     0
## 2           Action  588
## 3        Adventure  288
## 4        Animation   99
## 5           Comedy  634
## 6            Crime  141
## 7      Documentary   30
## 8            Drama  745
## 9           Family   38
## 10         Fantasy   93
## 11         Foreign    1
## 12         History   18
## 13          Horror  197
## 14           Music   20
## 15         Mystery   27
## 16         Romance   70
## 17 Science Fiction   79
## 18        Thriller  118
## 19        TV Movie    0
## 20             War   18
## 21         Western   22

We will drop the only 1 movie in Foreign genre as this genre is unpopular and 1 movie does not make sense in our prediction..

## 'data.frame':    3225 obs. of  14 variables:
##  $ X         : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ budget    : int  237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
##  $ genres    : Factor w/ 18 levels "Action","Adventure",..: 1 2 1 1 1 9 3 1 2 1 ...
##  $ popularity: num  150.4 139.1 107.4 112.3 43.9 ...
##  $ company   : Factor w/ 6 levels "Others","Paramount Pictures",..: 1 5 3 1 5 3 5 5 6 6 ...
##  $ date      : Date, format: "2009-12-10" "2007-05-19" ...
##  $ revenue   : num  2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
##  $ runtime   : num  162 169 148 165 132 139 100 141 153 151 ...
##  $ title     : Factor w/ 3224 levels "(500) Days of Summer",..: 259 1761 2129 2420 1265 2139 2256 260 1053 310 ...
##  $ score     : num  7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
##  $ vote      : int  11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
##  $ genrecount: int  4 3 3 4 3 3 2 3 3 3 ...
##  $ profit    : num  2.55e+09 6.61e+08 6.36e+08 8.35e+08 2.41e+07 ...
##  $ profitable: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Structure of movie data

## 'data.frame':    3225 obs. of  15 variables:
##  $ budget    : int  237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
##  $ genres    : Factor w/ 18 levels "Action","Adventure",..: 1 2 1 1 1 9 3 1 2 1 ...
##  $ popularity: num  150.4 139.1 107.4 112.3 43.9 ...
##  $ company   : Factor w/ 6 levels "Others","Paramount Pictures",..: 1 5 3 1 5 3 5 5 6 6 ...
##  $ date      : Date, format: "2009-12-10" "2007-05-19" ...
##  $ revenue   : num  2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
##  $ runtime   : num  162 169 148 165 132 139 100 141 153 151 ...
##  $ title     : Factor w/ 3224 levels "(500) Days of Summer",..: 259 1761 2129 2420 1265 2139 2256 260 1053 310 ...
##  $ score     : num  7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
##  $ vote      : int  11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
##  $ profit    : num  2.55e+09 6.61e+08 6.36e+08 8.35e+08 2.41e+07 ...
##  $ profitable: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ season    : Factor w/ 4 levels "Fall","Spring",..: 4 2 1 3 2 2 1 2 3 2 ...
##  $ quarter   : chr  "Q4" "Q2" "Q4" "Q3" ...
##  $ year      : num  2009 2007 2015 2012 2012 ...

Summary of the data

summary

##     revenue             budget           popularity     runtime   
##  Min.   :5.00e+00   Min.   :1.00e+00   Min.   :  0   Min.   : 41  
##  1st Qu.:1.71e+07   1st Qu.:1.05e+07   1st Qu.: 10   1st Qu.: 96  
##  Median :5.52e+07   Median :2.50e+07   Median : 20   Median :107  
##  Mean   :1.21e+08   Mean   :4.07e+07   Mean   : 29   Mean   :111  
##  3rd Qu.:1.46e+08   3rd Qu.:5.50e+07   3rd Qu.: 37   3rd Qu.:121  
##  Max.   :2.79e+09   Max.   :3.80e+08   Max.   :876   Max.   :338  
##      score           vote           profit         
##  Min.   :2.30   Min.   :    1   Min.   :-1.66e+08  
##  1st Qu.:5.80   1st Qu.:  179   1st Qu.: 2.52e+05  
##  Median :6.30   Median :  471   Median : 2.64e+07  
##  Mean   :6.31   Mean   :  978   Mean   : 8.07e+07  
##  3rd Qu.:6.90   3rd Qu.: 1148   3rd Qu.: 9.75e+07  
##  Max.   :8.50   Max.   :13752   Max.   : 2.55e+09
##      budget               genres      popularity 
##  Min.   :1.00e+00   Drama    :745   Min.   :  0  
##  1st Qu.:1.05e+07   Comedy   :634   1st Qu.: 10  
##  Median :2.50e+07   Action   :588   Median : 20  
##  Mean   :4.07e+07   Adventure:288   Mean   : 29  
##  3rd Qu.:5.50e+07   Horror   :197   3rd Qu.: 37  
##  Max.   :3.80e+08   Crime    :141   Max.   :876  
##                     (Other)  :632                
##                company          date               revenue        
##  Others            :1636   Min.   :1916-09-04   Min.   :5.00e+00  
##  Paramount Pictures: 255   1st Qu.:1998-09-10   1st Qu.:1.71e+07  
##  Sony Pictures     : 277   Median :2005-07-20   Median :5.52e+07  
##  Universal Pictures: 338   Mean   :2002-03-18   Mean   :1.21e+08  
##  Walt Disney       : 497   3rd Qu.:2010-11-11   3rd Qu.:1.46e+08  
##  Warner Bros       : 222   Max.   :2016-09-09   Max.   :2.79e+09  
##                                                                   
##     runtime                           title          score     
##  Min.   : 41   The Host                  :   2   Min.   :2.30  
##  1st Qu.: 96   (500) Days of Summer      :   1   1st Qu.:5.80  
##  Median :107   [REC]                     :   1   Median :6.30  
##  Mean   :111   [REC]²                    :   1   Mean   :6.31  
##  3rd Qu.:121   10 Cloverfield Lane       :   1   3rd Qu.:6.90  
##  Max.   :338   10 Things I Hate About You:   1   Max.   :8.50  
##                (Other)                   :3218                 
##       vote           profit          profitable    season   
##  Min.   :    1   Min.   :-1.66e+08   0: 787     Fall  :930  
##  1st Qu.:  179   1st Qu.: 2.52e+05   1:2438     Spring:704  
##  Median :  471   Median : 2.64e+07              Summer:837  
##  Mean   :  978   Mean   : 8.07e+07              Winter:754  
##  3rd Qu.: 1148   3rd Qu.: 9.75e+07                          
##  Max.   :13752   Max.   : 2.55e+09                          
##                                                             
##    quarter               year     
##  Length:3225        Min.   :1916  
##  Class :character   1st Qu.:1998  
##  Mode  :character   Median :2005  
##                     Mean   :2002  
##                     3rd Qu.:2010  
##                     Max.   :2016  
## 

Variance and SD

##    revenue     budget popularity    runtime      score       vote 
##   1.86e+08   4.44e+07   3.62e+01   2.10e+01   8.60e-01   1.41e+03 
##     profit 
##   1.58e+08
##    revenue     budget popularity    runtime      score       vote 
##   3.47e+16   1.97e+15   1.31e+03   4.40e+02   7.39e-01   2.00e+06 
##     profit 
##   2.50e+16

The means, variance and sd between variables are quite high as most of them have different scales. We need to scale the data for some models like linear regression, PCR, KNN …

Preparation

Correlation matrix

Test freq distribution of different genres in revenue

##               Df  Sum Sq  Mean Sq F value Pr(>F)    
## genres        17 1.4e+19 8.21e+17    26.9 <2e-16 ***
## Residuals   3207 9.8e+19 3.06e+16                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Overall, there is an evidence that the frequency distributions in profit by different genres are not the same. It seems that profit is dependent on genres.

Check freq disbution of different companies in revenue

##               Df   Sum Sq  Mean Sq F value Pr(>F)    
## company        5 4.80e+18 9.59e+17    28.8 <2e-16 ***
## Residuals   3219 1.07e+20 3.33e+16                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = revenue ~ company, data = movie)
## 
## $company
##                                            diff       lwr       upr p adj
## Paramount Pictures-Others              67454527  3.24e+07 102485737 0.000
## Sony Pictures-Others                   49800557  1.60e+07  83606812 0.000
## Universal Pictures-Others              89325542  5.82e+07 120413679 0.000
## Walt Disney-Others                     89174826  6.25e+07 115824795 0.000
## Warner Bros-Others                     29868762 -7.35e+06  67084432 0.199
## Sony Pictures-Paramount Pictures      -17653970 -6.28e+07  27502185 0.875
## Universal Pictures-Paramount Pictures  21871015 -2.13e+07  65029881 0.699
## Walt Disney-Paramount Pictures         21720299 -1.84e+07  61800672 0.635
## Warner Bros-Paramount Pictures        -37585764 -8.53e+07  10176371 0.218
## Universal Pictures-Sony Pictures       39524985 -2.65e+06  81695650 0.081
## Walt Disney-Sony Pictures              39374270  3.60e+05  78388543 0.046
## Warner Bros-Sony Pictures             -19931794 -6.68e+07  26939292 0.831
## Walt Disney-Universal Pictures          -150716 -3.68e+07  36533379 1.000
## Warner Bros-Universal Pictures        -59456780 -1.04e+08 -14506718 0.002
## Warner Bros-Walt Disney               -59306064 -1.01e+08 -17303008 0.001

Overall, there is an evidence that the frequency distributions in profit by different companies are not the same. It seems that profit is dependent on companies.

##               Df   Sum Sq  Mean Sq F value  Pr(>F)    
## season         3 1.59e+18 5.31e+17    15.5 5.3e-10 ***
## Residuals   3221 1.10e+20 3.43e+16                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = revenue ~ season, data = movie)
## 
## $season
##                    diff       lwr       upr p adj
## Spring-Fall    46272179  22500396  70043962 0.000
## Summer-Fall    49081296  26409934  71752657 0.000
## Winter-Fall     8412966 -14905901  31731832 0.790
## Summer-Spring   2809117 -21525012  27143245 0.991
## Winter-Spring -37859214 -62797712 -12920716 0.001
## Winter-Summer -40668330 -64560204 -16776456 0.000

It seems that winter and fall are in the same group and spring and summer are in the same group.

Split train and test sets

Split

Train and test

Linear regression model

The model

Construc the model on Train set (using all numberical variables)

## 
## Call:
## lm(formula = revenue ~ ., data = train1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.21e+08 -3.89e+07 -1.92e+06  2.46e+07  1.60e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 122161921    2168480   56.34  < 2e-16 ***
## budget       82502895    2822244   29.23  < 2e-16 ***
## popularity   14588304    2952684    4.94  8.4e-07 ***
## runtime      -1265467    2415512   -0.52     0.60    
## score          212212    2648862    0.08     0.94    
## vote         85807055    3723449   23.05  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01e+08 on 2170 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.711 
## F-statistic: 1.07e+03 on 5 and 2170 DF,  p-value: <2e-16
##     budget popularity    runtime      score       vote 
##       1.68       2.15       1.26       1.50       2.95

Test

##      mae      mse     rmse     mape 
## 6.25e+07 1.12e+16 1.06e+08 1.06e+04
##      mae      mse     rmse     mape 
## 5.86e+07 1.02e+16 1.01e+08 4.76e+03

Feature selection

All three feature selection methods show that predictors (budget + popularity + vote) form the best model.

Best Model

Construct model on train set

## 
## Call:
## lm(formula = revenue ~ budget + popularity + vote, data = train1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.21e+08 -3.85e+07 -2.19e+06  2.44e+07  1.60e+09 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.22e+08   2.17e+06   56.36  < 2e-16 ***
## budget      8.23e+07   2.62e+06   31.45  < 2e-16 ***
## popularity  1.46e+07   2.95e+06    4.96  7.8e-07 ***
## vote        8.57e+07   3.47e+06   24.72  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01e+08 on 2172 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.711 
## F-statistic: 1.78e+03 on 3 and 2172 DF,  p-value: <2e-16
##     budget popularity       vote 
##       1.45       2.15       2.55

Predict model on Test set

##      mae      mse     rmse     mape 
## 6.25e+07 1.12e+16 1.06e+08 1.06e+04
##      mae      mse     rmse     mape 
## 5.86e+07 1.02e+16 1.01e+08 4.76e+03

Model with categorical variables (genres, company and season)

Feature Selection

It seems that each season has the same effect on the model.

## 
## Call:
## lm(formula = revenue ~ ., data = train1_full)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.08e+08 -4.03e+07 -1.43e+06  2.90e+07  1.62e+09 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                98181070    6527080   15.04  < 2e-16 ***
## budget                     77348867    3033574   25.50  < 2e-16 ***
## popularity                 13617667    2932282    4.64  3.6e-06 ***
## runtime                     5900196    2594312    2.27  0.02305 *  
## score                      -1602884    2751829   -0.58  0.56030    
## vote                       87028064    3716646   23.42  < 2e-16 ***
## genresAdventure            14998048    8669568    1.73  0.08378 .  
## genresAnimation            87790816   13236699    6.63  4.2e-11 ***
## genresComedy               25268186    7163941    3.53  0.00043 ***
## genresCrime               -10102172   11467478   -0.88  0.37845    
## genresDocumentary          44711638   23188936    1.93  0.05397 .  
## genresDrama                 4923647    7294556    0.67  0.49976    
## genresFamily               82467003   20391297    4.04  5.4e-05 ***
## genresFantasy               2870822   13660650    0.21  0.83357    
## genresHistory              15689738   24998838    0.63  0.53032    
## genresHorror               17619142   10531825    1.67  0.09448 .  
## genresMusic                23079584   29354497    0.79  0.43182    
## genresMystery               6103862   24024211    0.25  0.79946    
## genresRomance              17057626   14926772    1.14  0.25327    
## genresScience Fiction     -15440346   15106799   -1.02  0.30686    
## genresThriller             -7679779   12337527   -0.62  0.53370    
## genresWar                 -56563625   33719837   -1.68  0.09360 .  
## genresWestern              -3456414   24929422   -0.14  0.88974    
## companyParamount Pictures  19120256    8303147    2.30  0.02139 *  
## companySony Pictures        4555821    7970621    0.57  0.56767    
## companyUniversal Pictures  16743343    7455117    2.25  0.02481 *  
## companyWalt Disney         16706288    6373478    2.62  0.00882 ** 
## companyWarner Bros          -535044    8559594   -0.06  0.95016    
## seasonSpring                9929435    6117507    1.62  0.10471    
## seasonSummer                9106694    5938990    1.53  0.12533    
## seasonWinter                4037639    5967395    0.68  0.49872    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99300000 on 2145 degrees of freedom
## Multiple R-squared:  0.725,  Adjusted R-squared:  0.721 
## F-statistic:  188 on 30 and 2145 DF,  p-value: <2e-16
##                    budget                popularity 
##                      2.01                      2.20 
##                   runtime                     score 
##                      1.51                      1.68 
##                      vote           genresAdventure 
##                      3.04                      1.40 
##           genresAnimation              genresComedy 
##                      1.25                      1.79 
##               genresCrime         genresDocumentary 
##                      1.26                      1.08 
##               genresDrama              genresFamily 
##                      2.06                      1.08 
##             genresFantasy             genresHistory 
##                      1.14                      1.07 
##              genresHorror               genresMusic 
##                      1.32                      1.04 
##             genresMystery             genresRomance 
##                      1.04                      1.13 
##     genresScience Fiction            genresThriller 
##                      1.11                      1.18 
##                 genresWar             genresWestern 
##                      1.03                      1.06 
## companyParamount Pictures      companySony Pictures 
##                      1.09                      1.11 
## companyUniversal Pictures        companyWalt Disney 
##                      1.12                      1.18 
##        companyWarner Bros              seasonSpring 
##                      1.08                      1.44 
##              seasonSummer              seasonWinter 
##                      1.48                      1.44

The model indicates no significance among seasons. Season seems no to be a necessary predictor.

Feature Selection

When inlcuding the season, genre and company in the model, the best numerical predictors are still budget, popularity and vote. We will build the model with these 3 predictors and 2 categorical variables genre and company.

Model with Genre & Company

## 
## Call:
## lm(formula = revenue ~ budget + vote + company + genres + popularity, 
##     data = train1_full)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.14e+08 -4.04e+07 -8.25e+05  2.96e+07  1.62e+09 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               103783104    5493099   18.89  < 2e-16 ***
## budget                     79307464    2810718   28.22  < 2e-16 ***
## vote                       87343750    3432074   25.45  < 2e-16 ***
## companyParamount Pictures  20187352    8296723    2.43  0.01505 *  
## companySony Pictures        4706890    7968099    0.59  0.55477    
## companyUniversal Pictures  17875529    7436574    2.40  0.01631 *  
## companyWalt Disney         16364447    6369330    2.57  0.01026 *  
## companyWarner Bros          -135185    8558273   -0.02  0.98740    
## genresAdventure            14924242    8645045    1.73  0.08443 .  
## genresAnimation            80203108   12811282    6.26  4.6e-10 ***
## genresComedy               24197175    7152449    3.38  0.00073 ***
## genresCrime                -8821228   11302914   -0.78  0.43522    
## genresDocumentary          40500854   22990580    1.76  0.07827 .  
## genresDrama                 6560315    6935298    0.95  0.34429    
## genresFamily               78112299   20296238    3.85  0.00012 ***
## genresFantasy               1516100   13618042    0.11  0.91136    
## genresHistory              22335236   24701813    0.90  0.36599    
## genresHorror               15897705   10491528    1.52  0.12985    
## genresMusic                22199706   29264598    0.76  0.44818    
## genresMystery               3544411   24017226    0.15  0.88269    
## genresRomance              16669177   14880624    1.12  0.26276    
## genresScience Fiction     -15795028   15101070   -1.05  0.29570    
## genresThriller             -7928898   12337416   -0.64  0.52051    
## genresWar                 -55041648   33646312   -1.64  0.10201    
## genresWestern                726641   24731085    0.03  0.97656    
## popularity                 13447493    2930278    4.59  4.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99400000 on 2150 degrees of freedom
## Multiple R-squared:  0.724,  Adjusted R-squared:  0.721 
## F-statistic:  225 on 25 and 2150 DF,  p-value: <2e-16
##                    budget                      vote 
##                      1.73                      2.59 
## companyParamount Pictures      companySony Pictures 
##                      1.09                      1.11 
## companyUniversal Pictures        companyWalt Disney 
##                      1.11                      1.17 
##        companyWarner Bros           genresAdventure 
##                      1.08                      1.39 
##           genresAnimation              genresComedy 
##                      1.17                      1.78 
##               genresCrime         genresDocumentary 
##                      1.22                      1.06 
##               genresDrama              genresFamily 
##                      1.86                      1.07 
##             genresFantasy             genresHistory 
##                      1.13                      1.04 
##              genresHorror               genresMusic 
##                      1.30                      1.03 
##             genresMystery             genresRomance 
##                      1.04                      1.12 
##     genresScience Fiction            genresThriller 
##                      1.11                      1.17 
##                 genresWar             genresWestern 
##                      1.03                      1.04 
##                popularity 
##                      2.19

The adj R-squared increases by 1% comparing to the the best model with numerical variables.

Prediction

##      mae      mse     rmse     mape 
## 6.19e+07 1.09e+16 1.05e+08 7.33e+03
##      mae      mse     rmse     mape 
## 5.83e+07 9.76e+15 9.88e+07 8.55e+03
## [1] 86399
## [1] 86439
## [1] "------"
## [1] 86395
## [1] 86424
## [1] "-------"
## [1] 86343
## [1] 86497

PCR

PCR

###Check variance

## Importance of components:
##                             PC1      PC2 PC3  PC4  PC5   PC6
## Standard deviation     1.89e+08 3.10e+07 926 23.9 20.1 0.699
## Proportion of Variance 9.74e-01 2.62e-02   0  0.0  0.0 0.000
## Cumulative Proportion  9.74e-01 1.00e+00   1  1.0  1.0 1.000
## Importance of components:
##                          PC1   PC2   PC3    PC4    PC5    PC6
## Standard deviation     1.768 1.108 0.894 0.6466 0.5045 0.4186
## Proportion of Variance 0.521 0.204 0.133 0.0697 0.0424 0.0292
## Cumulative Proportion  0.521 0.725 0.859 0.9284 0.9708 1.0000

Scale and rotate data

Train

## Data:    X dimension: 2176 5 
##  Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)    1 comps    2 comps    3 comps    4 comps    5 comps
## CV        1.88e+08  122329562  106688281  106237668  103730433  102073627
## adjCV     1.88e+08  122230125  106520321  106183882  103659547  102017104
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps
## X          48.50    71.69    87.41    95.61   100.00
## revenue    58.21    68.09    68.67    70.49    71.13
## Data:    X dimension: 2176 5 
##  Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)    1 comps    2 comps    3 comps    4 comps    5 comps
## CV        1.88e+08  122602547  109625493  107879872  103007610  103518646
## adjCV     1.88e+08  122517816  109502905  107831225  102938948  103373422
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps
## X          49.00    71.71    87.31    95.63   100.00
## revenue    57.87    66.61    67.94    70.47    71.13

  • It shows that with 2 components more than 90% of the variance of the data.

  • Scaled data have better variance explanation for revenue than non-scaled data.

Let’s try the pcr model on test data

There is a significant increase in the variance from PC1 to PC2, after that the change of variance is not too drastic. We can say that with 2 principal components we captured the most variance.

Try linear model with predictors as PCs:

## 
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.11e+09 -4.03e+07 -4.95e+06  2.42e+07  1.73e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 120012282    2277573    52.7   <2e-16 ***
## PC1         -92108152    1462914   -63.0   <2e-16 ***
## PC2          54901946    2115458    25.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.06e+08 on 2173 degrees of freedom
## Multiple R-squared:  0.681,  Adjusted R-squared:  0.681 
## F-statistic: 2.32e+03 on 2 and 2173 DF,  p-value: <2e-16
## PC1 PC2 
##   1   1
## 
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr.nc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.20e+09 -4.08e+07 -5.95e+06  2.28e+07  1.76e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 121493614    2330120    52.1   <2e-16 ***
## PC1         -89813816    1463518   -61.4   <2e-16 ***
## PC2          51258272    2149532    23.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09e+08 on 2173 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.666 
## F-statistic: 2.17e+03 on 2 and 2173 DF,  p-value: <2e-16
## PC1 PC2 
##   1   1
  • The scaled version gives better results than non-scaled.

Validation

## [1] 86611
## [1] 86633
## [1] 86710
## [1] 86732
  • Both AIC and BIC agree that the scaled version is better.

Logit

Prior Chi-squared test

## 
##  Pearson's Chi-squared test
## 
## data:  contable1
## X-squared = 54, df = 17, p-value = 1e-05
## 
##  Pearson's Chi-squared test
## 
## data:  contable2
## X-squared = 93, df = 5, p-value <2e-16
## 
##  Pearson's Chi-squared test
## 
## data:  contable3
## X-squared = 20, df = 3, p-value = 1e-04
  • Low p-values, there is evidence that profitable is dependent on season, company and genres.

In this part we will construc the logit model on the whole dataset.

## 'data.frame':    3225 obs. of  10 variables:
##  $ revenue   : num  2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
##  $ budget    : int  237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
##  $ popularity: num  150.4 139.1 107.4 112.3 43.9 ...
##  $ runtime   : num  162 169 148 165 132 139 100 141 153 151 ...
##  $ score     : num  7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
##  $ vote      : int  11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
##  $ genres    : Factor w/ 18 levels "Action","Adventure",..: 1 2 1 1 1 9 3 1 2 1 ...
##  $ company   : Factor w/ 6 levels "Others","Paramount Pictures",..: 1 5 3 1 5 3 5 5 6 6 ...
##  $ season    : Factor w/ 4 levels "Fall","Spring",..: 4 2 1 3 2 2 1 2 3 2 ...
##  $ y         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Budget and revenue are enough to decide the profitable as the profit is calculated as revenue subtracted by budget. Our pre-test with “bestglm” also shows the same result.

##   revenue budget popularity runtime score  vote genres company season
## 1    TRUE   TRUE      FALSE   FALSE FALSE FALSE  FALSE   FALSE  FALSE
## 2    TRUE   TRUE      FALSE   FALSE FALSE FALSE  FALSE   FALSE   TRUE
## 3    TRUE   TRUE      FALSE    TRUE FALSE FALSE  FALSE   FALSE  FALSE
## 4    TRUE   TRUE       TRUE   FALSE FALSE FALSE  FALSE   FALSE   TRUE
## 5    TRUE   TRUE      FALSE   FALSE FALSE  TRUE  FALSE   FALSE   TRUE
##   Criterion
## 1      17.4
## 2      17.6
## 3      18.1
## 4      18.3
## 5      18.5

However, since the relationship between (revenue + budget) and profitable is too direct, we better not use them together.

In reality, we prefer budget rather than revenue to predict profit. A film manager would want to have a prediction of the profit of a movie before its main released date. The information he/she have are the budget, runtime, genres, production company, popularity, vote and score (vote and score can be obtained by a preview screening of a movie, popularity can be generated after advertisement, trailers and some leaks from a movie).

Let’s try the model with budget and other predictors without revenue

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = movie_nd[-c(1)])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.467   0.000   0.293   0.728   1.866  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -2.31e+00   4.46e-01   -5.19  2.1e-07 ***
## budget                    -1.69e-08   2.03e-09   -8.34  < 2e-16 ***
## popularity                 2.21e-02   9.82e-03    2.25  0.02448 *  
## runtime                    7.95e-04   2.75e-03    0.29  0.77253    
## score                      2.51e-01   7.05e-02    3.57  0.00036 ***
## vote                       2.43e-03   3.38e-04    7.19  6.7e-13 ***
## genresAdventure            3.44e-02   2.10e-01    0.16  0.86979    
## genresAnimation            4.27e-01   3.49e-01    1.22  0.22116    
## genresComedy               4.48e-01   1.56e-01    2.88  0.00398 ** 
## genresCrime                4.70e-02   2.50e-01    0.19  0.85105    
## genresDocumentary          6.89e-01   4.42e-01    1.56  0.11943    
## genresDrama                4.30e-02   1.55e-01    0.28  0.78137    
## genresFamily               4.42e-01   4.59e-01    0.96  0.33546    
## genresFantasy              3.43e-01   3.58e-01    0.96  0.33770    
## genresHistory              3.96e-01   6.37e-01    0.62  0.53387    
## genresHorror               1.01e+00   2.66e-01    3.80  0.00015 ***
## genresMusic                4.91e-01   5.58e-01    0.88  0.37926    
## genresMystery             -1.80e-02   5.23e-01   -0.03  0.97260    
## genresRomance              6.07e-01   3.53e-01    1.72  0.08515 .  
## genresScience Fiction     -7.18e-02   3.82e-01   -0.19  0.85103    
## genresThriller             1.78e-02   2.76e-01    0.06  0.94866    
## genresWar                 -1.27e+00   6.10e-01   -2.08  0.03707 *  
## genresWestern              2.04e+00   8.44e-01    2.42  0.01548 *  
## companyParamount Pictures  9.52e-01   1.97e-01    4.83  1.4e-06 ***
## companySony Pictures       7.01e-01   1.82e-01    3.86  0.00011 ***
## companyUniversal Pictures  8.18e-01   1.91e-01    4.29  1.8e-05 ***
## companyWalt Disney         8.96e-01   1.52e-01    5.90  3.6e-09 ***
## companyWarner Bros         6.51e-01   2.01e-01    3.23  0.00123 ** 
## seasonSpring               2.71e-01   1.35e-01    2.01  0.04415 *  
## seasonSummer               4.75e-01   1.32e-01    3.58  0.00034 ***
## seasonWinter               2.85e-01   1.29e-01    2.21  0.02743 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3584.1  on 3224  degrees of freedom
## Residual deviance: 2656.1  on 3194  degrees of freedom
## AIC: 2718
## 
## Number of Fisher Scoring iterations: 8
##               (Intercept)                    budget 
##                    0.0989                    1.0000 
##                popularity                   runtime 
##                    1.0223                    1.0008 
##                     score                      vote 
##                    1.2858                    1.0024 
##           genresAdventure           genresAnimation 
##                    1.0350                    1.5325 
##              genresComedy               genresCrime 
##                    1.5653                    1.0481 
##         genresDocumentary               genresDrama 
##                    1.9909                    1.0439 
##              genresFamily             genresFantasy 
##                    1.5562                    1.4095 
##             genresHistory              genresHorror 
##                    1.4861                    2.7467 
##               genresMusic             genresMystery 
##                    1.6340                    0.9822 
##             genresRomance     genresScience Fiction 
##                    1.8350                    0.9307 
##            genresThriller                 genresWar 
##                    1.0179                    0.2806 
##             genresWestern companyParamount Pictures 
##                    7.7111                    2.5912 
##      companySony Pictures companyUniversal Pictures 
##                    2.0167                    2.2661 
##        companyWalt Disney        companyWarner Bros 
##                    2.4501                    1.9179 
##              seasonSpring              seasonSummer 
##                    1.3110                    1.6080 
##              seasonWinter 
##                    1.3304
##   budget popularity runtime score vote genres company season Criterion
## 1   TRUE       TRUE   FALSE  TRUE TRUE   TRUE    TRUE   TRUE      2714
## 2   TRUE       TRUE    TRUE  TRUE TRUE   TRUE    TRUE   TRUE      2716
## 3   TRUE      FALSE   FALSE  TRUE TRUE   TRUE    TRUE   TRUE      2717
## 4   TRUE      FALSE    TRUE  TRUE TRUE   TRUE    TRUE   TRUE      2719
## 5   TRUE       TRUE   FALSE  TRUE TRUE   TRUE    TRUE  FALSE      2722
  • It shows that the case of using budget and removing revenue, except runtime other predictor create the best model (least criterion).
  • In the other case, the best predictors are everything except company.
  • It is pretty clear that the model including revenue will have better performance with its lower criterion.

Build model without runtime

## 
## Call:
## glm(formula = y ~ budget + popularity + score + vote + genres + 
##     company + season, family = "binomial", data = movie_nd[-c(1)])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.462   0.000   0.294   0.729   1.880  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -2.27e+00   4.21e-01   -5.39  6.9e-08 ***
## budget                    -1.68e-08   1.96e-09   -8.55  < 2e-16 ***
## popularity                 2.22e-02   9.80e-03    2.26  0.02376 *  
## score                      2.58e-01   6.66e-02    3.88  0.00011 ***
## vote                       2.42e-03   3.37e-04    7.19  6.3e-13 ***
## genresAdventure            3.26e-02   2.10e-01    0.16  0.87638    
## genresAnimation            4.08e-01   3.43e-01    1.19  0.23407    
## genresComedy               4.45e-01   1.55e-01    2.87  0.00415 ** 
## genresCrime                5.04e-02   2.50e-01    0.20  0.83998    
## genresDocumentary          6.78e-01   4.41e-01    1.54  0.12380    
## genresDrama                4.93e-02   1.53e-01    0.32  0.74762    
## genresFamily               4.30e-01   4.57e-01    0.94  0.34700    
## genresFantasy              3.40e-01   3.58e-01    0.95  0.34133    
## genresHistory              4.20e-01   6.32e-01    0.66  0.50651    
## genresHorror               1.01e+00   2.66e-01    3.79  0.00015 ***
## genresMusic                4.88e-01   5.58e-01    0.87  0.38256    
## genresMystery             -2.19e-02   5.23e-01   -0.04  0.96663    
## genresRomance              6.05e-01   3.53e-01    1.72  0.08601 .  
## genresScience Fiction     -7.18e-02   3.82e-01   -0.19  0.85105    
## genresThriller             1.97e-02   2.76e-01    0.07  0.94315    
## genresWar                 -1.26e+00   6.09e-01   -2.07  0.03823 *  
## genresWestern              2.05e+00   8.44e-01    2.43  0.01524 *  
## companyParamount Pictures  9.51e-01   1.97e-01    4.82  1.4e-06 ***
## companySony Pictures       7.01e-01   1.82e-01    3.86  0.00012 ***
## companyUniversal Pictures  8.19e-01   1.91e-01    4.29  1.8e-05 ***
## companyWalt Disney         8.94e-01   1.52e-01    5.90  3.7e-09 ***
## companyWarner Bros         6.52e-01   2.01e-01    3.24  0.00121 ** 
## seasonSpring               2.71e-01   1.35e-01    2.01  0.04422 *  
## seasonSummer               4.75e-01   1.32e-01    3.58  0.00034 ***
## seasonWinter               2.86e-01   1.29e-01    2.21  0.02711 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3584.1  on 3224  degrees of freedom
## Residual deviance: 2656.2  on 3195  degrees of freedom
## AIC: 2716
## 
## Number of Fisher Scoring iterations: 8
## $`companyParamount Pictures`
## [1] 2.59
## 
## $`companySony Pictures`
## [1] 2.01
## 
## $`companyUniversal Pictures`
## [1] 2.27
## 
## $`companyWalt Disney`
## [1] 2.45
## 
## $`companyWarner Bros`
## [1] 1.92
## $seasonSpring
## [1] 1.31
## 
## $seasonSummer
## [1] 1.61
## 
## $seasonWinter
## [1] 1.33

Check the effect of genres and companies

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 68.1, df = 5, P(> X2) = 2.6e-13
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 39.2, df = 17, P(> X2) = 0.0017
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 13.6, df = 3, P(> X2) = 0.0036

We can validate the model with some methods:

Hosmer and Lemeshow test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  movie_nd[-c(1)]$y, fitted(prf_glm)
## X-squared = 3225, df = 8, p-value <2e-16
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  movie_nd[-c(1)]$y, fitted(prf_glm0)
## X-squared = 3225, df = 8, p-value <2e-16
  • Low p-value. Both models seem to be a good fit.

ROC curve and AUC:

  • We can try this on both train and test set
## Area under the curve: 0.841

## Area under the curve: 0.841

  • The area under the curve is more than 0.80. This test also agrees with the Hosmer and Lemeshow test.

McFadden

##       llh   llhNull        G2  McFadden      r2ML      r2CU 
## -1328.072 -1792.074   928.005     0.259     0.250     0.373
##       llh   llhNull        G2  McFadden      r2ML      r2CU 
## -1328.114 -1792.074   927.921     0.259     0.250     0.373
  • 25.9% the variance in y is explained by the predictors in our model. Not so bad.

Let’s test with revenue, assuming that the movie is released in a particular in a region and we obtain the revenue data for predicting if the movie will earn profit or not.

Let’s see which is the best model with revenue included and budget excluded

##   revenue popularity runtime score  vote genres company season Criterion
## 1    TRUE       TRUE    TRUE  TRUE  TRUE   TRUE   FALSE  FALSE      2070
## 2    TRUE       TRUE    TRUE  TRUE FALSE   TRUE   FALSE  FALSE      2070
## 3    TRUE      FALSE    TRUE  TRUE FALSE   TRUE   FALSE  FALSE      2071
## 4    TRUE      FALSE    TRUE  TRUE  TRUE   TRUE   FALSE  FALSE      2072
## 5    TRUE       TRUE    TRUE  TRUE  TRUE   TRUE    TRUE  FALSE      2073

In this model, we remove budget and do not count company and season according to the model selection.

## 
## Call:
## glm(formula = y ~ revenue + popularity + score + vote + genres + 
##     runtime, family = "binomial", data = movie_nd[-c(2)])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.949   0.000   0.050   0.497   2.283  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -4.67e+00   5.11e-01   -9.13  < 2e-16 ***
## revenue                5.43e-08   2.89e-09   18.78  < 2e-16 ***
## popularity            -1.80e-02   8.73e-03   -2.06  0.03959 *  
## score                  9.74e-01   8.32e-02   11.72  < 2e-16 ***
## vote                   3.75e-04   2.68e-04    1.40  0.16175    
## genresAdventure       -6.59e-01   2.65e-01   -2.48  0.01301 *  
## genresAnimation       -1.59e+00   4.98e-01   -3.19  0.00143 ** 
## genresComedy           6.29e-01   1.80e-01    3.49  0.00048 ***
## genresCrime            3.55e-01   2.78e-01    1.28  0.20098    
## genresDocumentary      4.01e-01   4.69e-01    0.85  0.39302    
## genresDrama            3.15e-01   1.77e-01    1.77  0.07626 .  
## genresFamily          -4.24e-01   5.66e-01   -0.75  0.45389    
## genresFantasy          2.30e-01   4.18e-01    0.55  0.58251    
## genresHistory          9.61e-01   7.26e-01    1.32  0.18550    
## genresHorror           1.73e+00   2.86e-01    6.06  1.4e-09 ***
## genresMusic            3.26e-01   5.90e-01    0.55  0.58028    
## genresMystery          2.57e-01   5.86e-01    0.44  0.66115    
## genresRomance          5.81e-01   3.98e-01    1.46  0.14448    
## genresScience Fiction  1.16e-01   4.36e-01    0.27  0.79065    
## genresThriller         5.15e-01   3.15e-01    1.63  0.10243    
## genresWar             -1.38e+00   7.22e-01   -1.90  0.05685 .  
## genresWestern          2.49e+00   8.07e-01    3.08  0.00204 ** 
## runtime               -2.36e-02   3.28e-03   -7.19  6.4e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3584.1  on 3224  degrees of freedom
## Residual deviance: 2026.0  on 3202  degrees of freedom
## AIC: 2072
## 
## Number of Fisher Scoring iterations: 8

Better AIC than the above model. As expected since the the profit is shown to be more related to revenue as in the corplot

##       llh   llhNull        G2  McFadden      r2ML      r2CU 
## -1013.011 -1792.074  1558.127     0.435     0.383     0.571

43.5% y variation is explained.

## Area under the curve: 0.913

## [1] 2716
## [1] 2899
## [1] "---"
## [1] 2072
## [1] 2212

Overall, I do not prefer the model using revenue as revenue is strongly correlated to profit (so it might explain the profit status very well), and revenue comes after the release of movie so it does not make sense for prediction. We should use known variables before the release of a movie such as budget, genres, company, vote, runtime, score, …

Some may argue that we can obtain revenue in a region as a sample for the prediction, however, it is clear that the regional revenue is not a representative for the world revenue.

We also used the chi-test and see that profit status is dependent on company and season but with revenue we can ignore these variables. It seems not to be a practical case.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  686  920
##          1  101 1518
##                                         
##                Accuracy : 0.683         
##                  95% CI : (0.667, 0.699)
##     No Information Rate : 0.756         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.366         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.872         
##             Specificity : 0.623         
##          Pos Pred Value : 0.427         
##          Neg Pred Value : 0.938         
##              Prevalence : 0.244         
##          Detection Rate : 0.213         
##    Detection Prevalence : 0.498         
##       Balanced Accuracy : 0.747         
##                                         
##        'Positive' Class : 0             
## 
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                        3224       3584             
## budget      1       40      3223       3545  3.2e-10 ***
## popularity  1      619      3222       2925  < 2e-16 ***
## runtime     1        0      3221       2925  0.99475    
## score       1       28      3220       2897  1.2e-07 ***
## vote        1      109      3219       2788  < 2e-16 ***
## genres     17       45      3202       2743  0.00021 ***
## company     5       73      3197       2670  2.4e-14 ***
## season      3       14      3194       2656  0.00348 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s see between genres and company, which factors are better for the model.

## 
## Call:
## glm(formula = y ~ budget + popularity + score + vote + genres, 
##     family = "binomial", data = movie_nd[-c(1)])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.487   0.000   0.309   0.769   1.765  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.65e+00   3.98e-01   -4.15  3.4e-05 ***
## budget                -1.38e-08   1.90e-09   -7.29  3.0e-13 ***
## popularity             2.32e-02   9.71e-03    2.39  0.01702 *  
## score                  2.40e-01   6.49e-02    3.70  0.00021 ***
## vote                   2.34e-03   3.31e-04    7.06  1.7e-12 ***
## genresAdventure        3.25e-02   2.06e-01    0.16  0.87496    
## genresAnimation        3.89e-01   3.31e-01    1.18  0.23907    
## genresComedy           5.09e-01   1.52e-01    3.36  0.00078 ***
## genresCrime            9.62e-03   2.45e-01    0.04  0.96867    
## genresDocumentary      5.08e-01   4.32e-01    1.18  0.23956    
## genresDrama            1.80e-02   1.50e-01    0.12  0.90431    
## genresFamily           5.23e-01   4.49e-01    1.16  0.24408    
## genresFantasy          4.55e-01   3.51e-01    1.30  0.19498    
## genresHistory          5.63e-01   6.32e-01    0.89  0.37299    
## genresHorror           1.01e+00   2.60e-01    3.87  0.00011 ***
## genresMusic            4.25e-01   5.34e-01    0.80  0.42581    
## genresMystery         -4.13e-02   5.12e-01   -0.08  0.93561    
## genresRomance          5.27e-01   3.44e-01    1.53  0.12585    
## genresScience Fiction -2.29e-02   3.71e-01   -0.06  0.95085    
## genresThriller        -1.59e-01   2.70e-01   -0.59  0.55721    
## genresWar             -1.09e+00   5.89e-01   -1.85  0.06418 .  
## genresWestern          1.73e+00   8.12e-01    2.13  0.03331 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3584.1  on 3224  degrees of freedom
## Residual deviance: 2742.8  on 3203  degrees of freedom
## AIC: 2787
## 
## Number of Fisher Scoring iterations: 8
## 
## Call:
## glm(formula = y ~ budget + popularity + score + vote + genres, 
##     family = "binomial", data = movie_nd[-c(1)])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.487   0.000   0.309   0.769   1.765  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.65e+00   3.98e-01   -4.15  3.4e-05 ***
## budget                -1.38e-08   1.90e-09   -7.29  3.0e-13 ***
## popularity             2.32e-02   9.71e-03    2.39  0.01702 *  
## score                  2.40e-01   6.49e-02    3.70  0.00021 ***
## vote                   2.34e-03   3.31e-04    7.06  1.7e-12 ***
## genresAdventure        3.25e-02   2.06e-01    0.16  0.87496    
## genresAnimation        3.89e-01   3.31e-01    1.18  0.23907    
## genresComedy           5.09e-01   1.52e-01    3.36  0.00078 ***
## genresCrime            9.62e-03   2.45e-01    0.04  0.96867    
## genresDocumentary      5.08e-01   4.32e-01    1.18  0.23956    
## genresDrama            1.80e-02   1.50e-01    0.12  0.90431    
## genresFamily           5.23e-01   4.49e-01    1.16  0.24408    
## genresFantasy          4.55e-01   3.51e-01    1.30  0.19498    
## genresHistory          5.63e-01   6.32e-01    0.89  0.37299    
## genresHorror           1.01e+00   2.60e-01    3.87  0.00011 ***
## genresMusic            4.25e-01   5.34e-01    0.80  0.42581    
## genresMystery         -4.13e-02   5.12e-01   -0.08  0.93561    
## genresRomance          5.27e-01   3.44e-01    1.53  0.12585    
## genresScience Fiction -2.29e-02   3.71e-01   -0.06  0.95085    
## genresThriller        -1.59e-01   2.70e-01   -0.59  0.55721    
## genresWar             -1.09e+00   5.89e-01   -1.85  0.06418 .  
## genresWestern          1.73e+00   8.12e-01    2.13  0.03331 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3584.1  on 3224  degrees of freedom
## Residual deviance: 2742.8  on 3203  degrees of freedom
## AIC: 2787
## 
## Number of Fisher Scoring iterations: 8

KNN

Season

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1049 
## 
##  
##              | season_knn 
## test2$season |      Fall |    Spring |    Summer |    Winter | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##         Fall |       113 |        71 |        70 |        73 |       327 | 
##              |     0.346 |     0.217 |     0.214 |     0.223 |     0.312 | 
##              |     0.365 |     0.300 |     0.268 |     0.303 |           | 
##              |     0.108 |     0.068 |     0.067 |     0.070 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##       Spring |        51 |        48 |        75 |        40 |       214 | 
##              |     0.238 |     0.224 |     0.350 |     0.187 |     0.204 | 
##              |     0.165 |     0.203 |     0.287 |     0.166 |           | 
##              |     0.049 |     0.046 |     0.071 |     0.038 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##       Summer |        82 |        68 |        59 |        72 |       281 | 
##              |     0.292 |     0.242 |     0.210 |     0.256 |     0.268 | 
##              |     0.265 |     0.287 |     0.226 |     0.299 |           | 
##              |     0.078 |     0.065 |     0.056 |     0.069 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##       Winter |        64 |        50 |        57 |        56 |       227 | 
##              |     0.282 |     0.220 |     0.251 |     0.247 |     0.216 | 
##              |     0.206 |     0.211 |     0.218 |     0.232 |           | 
##              |     0.061 |     0.048 |     0.054 |     0.053 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       310 |       237 |       261 |       241 |      1049 | 
##              |     0.296 |     0.226 |     0.249 |     0.230 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## 
## 

k= 35

Genres

k = 27

Company

k= 17

Random Forests

## 'data.frame':    2669 obs. of  3 variables:
##  $ revenue: num  2.64e+08 3.37e+08 1.00e+07 3.66e+08 1.28e+08 ...
##  $ year   : num  1995 1995 1995 1995 1995 ...
##  $ quarter: chr  "Q3" "Q2" "Q4" "Q2" ...
## # A tibble: 87 x 3
## # Groups:   year [22]
##     year quarter    revenue
##    <dbl> <fct>        <dbl>
##  1  1995 Q1       252877967
##  2  1995 Q2      2630639816
##  3  1995 Q3      1562025856
##  4  1995 Q4      1637810344
##  5  1996 Q1       399881330
##  6  1996 Q2      3002163064
##  7  1996 Q3       905547568
##  8  1996 Q4      2465830133
##  9  1997 Q1       722506511
## 10  1997 Q2      2052763938
## # … with 77 more rows
##          Qtr1     Qtr2     Qtr3     Qtr4
## 2011 3.16e+09 7.61e+09 4.39e+09 5.30e+09
## 2012 3.88e+09 8.56e+09 4.42e+09 6.92e+09
## 2013 3.86e+09 7.01e+09 5.34e+09 6.94e+09
## 2014 4.90e+09 6.67e+09 4.18e+09 8.30e+09
## 2015 3.27e+09 9.24e+09 5.34e+09 4.62e+09
## 2016 4.22e+09 7.70e+09 2.53e+09

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.55e+09 -6.35e+08  2.92e+08  0.00e+00  9.27e+08  9.67e+08
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -1.05e+09 -4.67e+08 -8.91e+07 -3.50e+06  4.29e+08  1.99e+09         4

## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = movie.ts)
## 
## Smoothing parameters:
##  alpha: 0.0475
##  beta : 0.154
##  gamma: 0.313
## 
## Coefficients:
##         [,1]
## a   4.70e+09
## b   5.96e+07
## s1 -1.15e+09
## s2  1.73e+09
## s3  3.53e+08
## s4  1.03e+09
## [1] 4.17e+19